38 research outputs found
Deep Neural Machine Translation with Weakly-Recurrent Units
Recurrent neural networks (RNNs) have represented for years the state of the
art in neural machine translation. Recently, new architectures have been
proposed, which can leverage parallel computation on GPUs better than classical
RNNs. Faster training and inference combined with different
sequence-to-sequence modeling also lead to performance improvements. While the
new models completely depart from the original recurrent architecture, we
decided to investigate how to make RNNs more efficient. In this work, we
propose a new recurrent NMT architecture, called Simple Recurrent NMT, built on
a class of fast and weakly-recurrent units that use layer normalization and
multiple attentions. Our experiments on the WMT14 English-to-German and WMT16
English-Romanian benchmarks show that our model represents a valid alternative
to LSTMs, as it can achieve better results at a significantly lower
computational cost.Comment: 10 pages, 3 figures, accepted as a conference paper at the 21st
Annual Conference of the European Association for Machine Translation (EAMT)
201
Fine-tuning on Clean Data for End-to-End Speech Translation: FBK @ IWSLT 2018
This paper describes FBK's submission to the end-to-end English-German speech
translation task at IWSLT 2018. Our system relies on a state-of-the-art model
based on LSTMs and CNNs, where the CNNs are used to reduce the temporal
dimension of the audio input, which is in general much higher than machine
translation input. Our model was trained only on the audio-to-text parallel
data released for the task, and fine-tuned on cleaned subsets of the original
training corpus. The addition of weight normalization and label smoothing
improved the baseline system by 1.0 BLEU point on our validation set. The final
submission also featured checkpoint averaging within a training run and
ensemble decoding of models trained during multiple runs. On test data, our
best single model obtained a BLEU score of 9.7, while the ensemble obtained a
BLEU score of 10.24.Comment: 6 pages, 2 figures, system description at the 15th International
Workshop on Spoken Language Translation (IWSLT) 201
One-To-Many Multilingual End-to-end Speech Translation
Nowadays, training end-to-end neural models for spoken language translation
(SLT) still has to confront with extreme data scarcity conditions. The existing
SLT parallel corpora are indeed orders of magnitude smaller than those
available for the closely related tasks of automatic speech recognition (ASR)
and machine translation (MT), which usually comprise tens of millions of
instances. To cope with data paucity, in this paper we explore the
effectiveness of transfer learning in end-to-end SLT by presenting a
multilingual approach to the task. Multilingual solutions are widely studied in
MT and usually rely on ``\textit{target forcing}'', in which multilingual
parallel data are combined to train a single model by prepending to the input
sequences a language token that specifies the target language. However, when
tested in speech translation, our experiments show that MT-like \textit{target
forcing}, used as is, is not effective in discriminating among the target
languages. Thus, we propose a variant that uses target-language embeddings to
shift the input representations in different portions of the space according to
the language, so to better support the production of output in the desired
target language. Our experiments on end-to-end SLT from English into six
languages show important improvements when translating into similar languages,
especially when these are supported by scarce data. Further improvements are
obtained when using English ASR data as an additional language (up to
BLEU points).Comment: 8 pages, one figure, version accepted at ASRU 201
Take the Hint: Improving Arabic Diacritization with Partially-Diacritized Text
Automatic Arabic diacritization is useful in many applications, ranging from
reading support for language learners to accurate pronunciation predictor for
downstream tasks like speech synthesis. While most of the previous works
focused on models that operate on raw non-diacritized text, production systems
can gain accuracy by first letting humans partly annotate ambiguous words. In
this paper, we propose 2SDiac, a multi-source model that can effectively
support optional diacritics in input to inform all predictions. We also
introduce Guided Learning, a training scheme to leverage given diacritics in
input with different levels of random masking. We show that the provided hints
during test affect more output positions than those annotated. Moreover,
experiments on two common benchmarks show that our approach i) greatly
outperforms the baseline also when evaluated on non-diacritized text; and ii)
achieves state-of-the-art results while reducing the parameter count by over
60%.Comment: Arabic text diacritization, partially-diacritized text, Arabic
natural language processin
Deep Neural Machine Translation with Weakly-Recurrent Units
Recurrent neural networks (RNNs) have represented for years the state of the art in neural machine translation. Recently, new architectures have been proposed, which can leverage parallel computation on GPUs better than classical RNNs. Faster training and inference combined with different sequence-to-sequence modeling also lead to performance improvements. While the new models completely depart from the original recurrent architecture, we decided to investigate how to make RNNs more efficient. In this work, we propose a new recurrent NMT architecture, called Simple Recurrent NMT, built on a class of fast and weakly-recurrent units that use layer normalization and multiple attentions. Our experiments on the WMT14 English-to-German and WMT16 English-Romanian benchmarks show that our model represents a valid alternative to LSTMs, as it can achieve better results at a significantly lower computational cost
Robust Neural Machine Translation for Clean and Noisy Speech Transcripts
Neural machine translation models have shown to achieve high quality when trained and fed with well structured and punctuated input texts. Unfortunately, the latter condition is not met in spoken language translation, where the input is generated by an automatic speech recognition (ASR) system. In this paper, we study how to adapt a strong NMT system to make it robust to typical ASR errors. As in our application scenarios transcripts might be post-edited by human experts, we propose adaptation strategies to train a single system that can translate either clean or noisy input with no supervision on the input type. Our experimental results on a public speech translation data set show that adapting a model on a significant amount of parallel data including ASR transcripts is beneficial with test data of the same type, but produces a small degradation when translating clean text. Adapting on both clean and noisy variants of the same data leads to the best results on both input types
End-to-End Speech-Translation with Knowledge Distillation: FBK@IWSLT2020
This paper describes FBK's participation in the IWSLT 2020 offline speech
translation (ST) task. The task evaluates systems' ability to translate English
TED talks audio into German texts. The test talks are provided in two versions:
one contains the data already segmented with automatic tools and the other is
the raw data without any segmentation. Participants can decide whether to work
on custom segmentation or not. We used the provided segmentation. Our system is
an end-to-end model based on an adaptation of the Transformer for speech data.
Its training process is the main focus of this paper and it is based on: i)
transfer learning (ASR pretraining and knowledge distillation), ii) data
augmentation (SpecAugment, time stretch and synthetic data), iii) combining
synthetic and real data marked as different domains, and iv) multi-task
learning using the CTC loss. Finally, after the training with word-level
knowledge distillation is complete, our ST models are fine-tuned using label
smoothed cross entropy. Our best model scored 29 BLEU on the MuST-C En-De test
set, which is an excellent result compared to recent papers, and 23.7 BLEU on
the same data segmented with VAD, showing the need for researching solutions
addressing this specific data condition.Comment: Accepted at IWSLT202
On Target Segmentation for Direct Speech Translation
Recent studies on direct speech translation show continuous improvements by
means of data augmentation techniques and bigger deep learning models. While
these methods are helping to close the gap between this new approach and the
more traditional cascaded one, there are many incongruities among different
studies that make it difficult to assess the state of the art. Surprisingly,
one point of discussion is the segmentation of the target text. Character-level
segmentation has been initially proposed to obtain an open vocabulary, but it
results on long sequences and long training time. Then, subword-level
segmentation became the state of the art in neural machine translation as it
produces shorter sequences that reduce the training time, while being superior
to word-level models. As such, recent works on speech translation started using
target subwords despite the initial use of characters and some recent claims of
better results at the character level. In this work, we perform an extensive
comparison of the two methods on three benchmarks covering 8 language
directions and multilingual training. Subword-level segmentation compares
favorably in all settings, outperforming its character-level counterpart in a
range of 1 to 3 BLEU points.Comment: 14 pages single column, 4 figures, accepted for presentation at the
AMTA2020 research trac
Instance-Based Model Adaptation For Direct Speech Translation
Despite recent technology advancements, the effectiveness of neural
approaches to end-to-end speech-to-text translation is still limited by the
paucity of publicly available training corpora. We tackle this limitation with
a method to improve data exploitation and boost the system's performance at
inference time. Our approach allows us to customize "on the fly" an existing
model to each incoming translation request. At its core, it exploits an
instance selection procedure to retrieve, from a given pool of data, a small
set of samples similar to the input query in terms of latent properties of its
audio signal. The retrieved samples are then used for an instance-specific
fine-tuning of the model. We evaluate our approach in three different
scenarios. In all data conditions (different languages, in/out-of-domain
adaptation), our instance-based adaptation yields coherent performance gains
over static models.Comment: 6 pages, under review at ICASSP 202